Constraint-Based Entity Matching
نویسندگان
چکیده
Entity matching is the problem of deciding if two given mentions in the data, such as “Helen Hunt” and “H. M. Hunt”, refer to the same real-world entity. Numerous solutions have been developed, but they have not considered in depth the problem of exploiting integrity constraints that frequently exist in the domains. Examples of such constraints include “a mention with age two cannot match a mention with salary 200K” and “if two paper citations match, then their authors are likely to match in the same order”. In this paper we describe a probabilistic solution to entity matching that exploits such constraints to improve matching accuracy. At the heart of the solution is a generative model that takes into account the constraints during the generation process, and provides well-defined interpretations of the constraints. We describe a novel combination of EM and relaxation labeling algorithms that efficiently learns the model, thereby matching mentions in an unsupervised way, without the need for annotated training data. Experiments on several real-world domains show that our solution can exploit constraints to significantly improve matching accuracy, by 3-12% F-1, and that the solution scales up to large data sets.
منابع مشابه
A Constraint Language Approach to Grid Resource Selection
The need to discover and select entities that match specified requirements arises in many contexts in distributed systems. Meeting this need is complicated by the fact that not only may the potential consumer specify constraints on resources, but the owner of the entity in question may specify constraints on the consumer. This observation has motivated Raman et al. to propose that discovery and...
متن کاملA Hierarchical Image Matching Method for Stereo Satellite Imagery
Image matching is an essential and difficult task in digital photogrammetry and computer vision. This paper presents a triangulationbased hierarchical image matching algorithm for stereo satellite imagery. It uses a coarse-to-fine hierarchical strategy and combines feature points and grid points to provide a dense, precise and reliable matching result. First, some seed points are extracted at t...
متن کاملAdaptive Approximate Record Matching
Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...
متن کاملMulti-level NER for Portuguese in a CG Framework
This paper describes and evaluates a linguistically based NER system for Portuguese, based on lexico-semantical information, pattern matching and morphosyntactic, context driven Constraint Grammar rules. Preliminary Fscores for cross-domain news texts, when distinguishing six different name types, were 91.85 (raw) and 93.6 (subtyping of ready-chunked proper nouns).
متن کاملConstraint-Based Reasoning in Geographic Databases: the Case of Symbolic Arrays
Symbolic arrays are hierarchical constraint-based representations that preserve direction relations (e.g., north, northeast) among the distinct components of complex spatial entities. They have been used in problems involving pattern matching and spatial information retrieval. In this paper we demonstrate how inference can be achieved in geographic databases of symbolic arrays using composition...
متن کاملA Named Entity Recognizer for Danish
This paper describes how a preexisting Constraint Grammar based parser for Danish (DanGram, Bick 2002) has been adapted and semantically enhanced in order to accommodate for named entity recognition (NER), using rule based and lexical, rather than probabilistic methodology. The project is part of a multi-lingual Nordic initiative, Nomen Nescio, which targets 6 primary name types (human, organis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005